Single cell analysis: Adams et al idiopathic pulmonary fibrosis (IPF)



Introduction

This is a automated script to [1] annotate cell types; [2] scale and normalize expression data; [3] remove dead cells, doublets, and outlying samples; and [3] perform dimension reduction and differential expression on expression data. This script can take in more than a single seurat objects.


Data for this analysis was taken from here


Figure 1) Legend from Adams et al idiopathic pulmonary fibrosis (IPF): A) Overview of experimental design. (i) Disease lung explants and unused donor lungs collected. (ii) Lungs dissociated to single-cell suspension. (iii) Droplet-based scRNA-seq library preparation (iv) sequencing. (v) Exploratory analysis. (vi) Spatial localization with IHC. (B) Uniform Manifold Approximation and Projection (UMAP) representation of 312,928 cells from 32 IPF, 18 COPD, and 28 control donor lungs; each dot represents a single cell, and cells are labeled as one of 38 discrete cell varieties. AT, alveolar type; cDC, classical dendritic cell; pDC, plasmacytoid dendritic cell; M, macrophage; NK, natural killer; ILC, innate lymphoid cell; PNEC, pulmonary neuroendocrine cell; SMC, smooth muscle cell; VE, vascular endothelial. (C) Heat map of marker genes for all 38 identified cell types, categorized into four broad cell categories. Each cell type is represented by the top five genes ranked by false discovery rate (FDR) adjusted P value of a Wilcoxon rank sum test between the average expression per subject value for each cell type against the other average subject expression of the other cell types in their respective grouping. Each column represents the average expression value for one subject, hierarchically grouped by disease status and cell type. Gene expression values are unity normalized from 0 to 1 across rows within each categorical cell type group.



1 - Quality control and standardization file (QCStanAnn)

In this section, we perform quality control and standardization of single cell RNA-seq datasets.

NOTE: Make sure the data is the unedited seurat file.

### ---------------------LOAD THE SEURAT OBJECT---------------------------------
if (params$is_list == TRUE & params$split_by_group == FALSE) {
  seurats <- readRDS(file = params$dir_seurat)

  for (i in 1:length(seurats)) {
    Idents(seurats[[i]]) <- seurats[[i]]@meta.data[, params$qcs_indent_var]
    seurats[[i]][["percent.mt"]] <- PercentageFeatureSet(object = seurats[[i]], pattern = "^MT|^mt")
  }

} else {

  seurat_analysed <- readRDS(file = params$dir_seurat)

  # seurat_analysed@meta.data[,"orig.ident"] = gsub("BIOKEY", "Sample", seurat_analysed@meta.data[,"orig.ident"])
  # seurat_analysed@meta.data[,"Cell"] = gsub("BIOKEY", "Sample", seurat_analysed@meta.data[,"Cell"])
  # seurat_analysed@meta.data[,"patient_id"] = gsub("BIOKEY", "Sample", seurat_analysed@meta.data[,"patient_id"])

  ### Change ident to original idents
  Idents(seurat_analysed) <- seurat_analysed@meta.data[, params$qcs_indent_var]

  ### ------------------ADD mito.gene % IN METADATA--------------------------------
  seurat_analysed[["percent.mt"]] <- PercentageFeatureSet(object = seurat_analysed, pattern = "^MT|^mt")

  ### Set identity
  if (params$group_name == "none") {
    seurat_analysed@meta.data$params_group <- "No group"
  } else {
    seurat_analysed@meta.data$params_group <- seurat_analysed@meta.data[, params$group_name]
  }
}


### ---------FOR MULTIPLE BATCHES THAT NEED TO BE PROCESSED SEPARATELY----------
### This is often seen when CD45+ and CD45- are ran separately. The two extremely
### different cell types can have varying degrees of doublets and mitochondria
### gene expression. As such, they should be processed separately. If ran in a
### single batch, then this is not necessary, but we'll still put it into a list.
if (params$is_list == FALSE & params$split_by_group == TRUE) {
  seurats <- SplitObject(seurat_analysed, split.by = params$group_name)
}

if (params$is_list == FALSE & params$split_by_group == FALSE) {
  seurats <- list(seurat_analysed)
  names(seurats) <- params$group_name
}

### ------------------DEFINE THE COLORS FOR REPORT------------------------------
colz <- list()
for (i in 1:length(seurats)) {
  colz[[i]] <- get_colorz(length(unique(seurats[[i]]@meta.data[, params$qcs_indent_var])),
    palette1,
    unique(seurats[[i]]@meta.data[, params$qcs_indent_var]))
}

1.1 - Quality Control

In section 1.1, we remove dead cells, doublets, and samples that are flagged as outliers. In addition, we can remove genes that only show up in a set number of cells. For example, if the gene only shows up in 1/100,000 cells, it is worth removing.

Cell level (pre removal)

Here we will remove dead cells, doublets, and cells with low molecule counts. We do so by setting cut-offs for the following 3 parameters:

  1. nFeature_RNA is the number of genes detected in each cell.
  2. nCount_RNA is the total number of molecules detected within a cell
  3. percent.mt is the percent of features that are mitochondrial genes

NOTE

There are several things to consider. For one, a much higher number of gene counts of molecule counts may indicate the presence of doublets, especially for droplet-based scRNA-seq like 10X. In addition, high mitochondrial gene percentages may indicate dead cells, which is common in plate-based scRNA-seq like smart-seq.

### Plot the three parameters mentioned above
pre_plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    plot_grid(

      violin_me(seurats[[i]], "nFeature_RNA", colorz = colz[[i]], themes$QB_theme_90_small(), legend = "none") + ggtitle("Gene count", subtitle = paste0(names(seurats[i]))),
      violin_me(seurats[[i]], "nCount_RNA", colorz = colz[[i]], themes$QB_theme_90_small(), legend = "none") + ggtitle("UMI count", subtitle = paste0(names(seurats[i]))),
      violin_me(seurats[[i]], "percent.mt", colorz = colz[[i]], themes$QB_theme_90_small(), legend = "none") + ggtitle("% MT genes", subtitle = paste0(names(seurats[i]))),

      ncol = 1  ### For column count

    ),
    list(i = i)))
  pre_plots[[i]] <- p1  # add each plot into plot list
}


### Plot it out
# cowplot::plot_grid(plotlist = pre_plots, nrow = length(seurats), ### Fix beginning here
#                    ncol = 1,
#                    vjust = 0.5)

#### Plot it
for (i in 1:length(pre_plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- pre_plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none

### -----------------------------CELL FILTER-------------------------------------
### Filter out cells that have QC metrics +/- away from 3*MAD:
### nFeature_RNA metric > 200-3*MAD(nFeature_RNA) and < median(nFeature_RNA)+3*MAD(nFeature_RNA)
### nCount_RNA < median(nCountRNA)+3*MAD(nCountRNA)
### percent.mt < median(percent.mt)+3*MAD(percent.mt)
for (i in 1:length(seurats)) {
  seurats[[i]] <- subset(x = seurats[[i]],
    subset = nFeature_RNA >  max(params$qcs_std_nFeature_RNA, median(seurats[[i]]@meta.data$nFeature_RNA, na.rm = TRUE) - 3 * mad(seurats[[i]]@meta.data$nFeature_RNA, constant = 1.4826, na.rm = TRUE)) &
      nFeature_RNA < median(seurats[[i]]@meta.data$nFeature_RNA, na.rm = TRUE) + 3 * mad(seurats[[i]]@meta.data$nFeature_RNA, constant = 1.4826, na.rm = TRUE) &
      nCount_RNA <  median(seurats[[i]]@meta.data$nCount_RNA, na.rm = TRUE) + 3 * mad(seurats[[i]]@meta.data$nCount_RNA, constant = 1.4826, na.rm = TRUE) &
      percent.mt < median(seurats[[i]]@meta.data$percent.mt, na.rm = TRUE) + 3 * mad(seurats[[i]]@meta.data$percent.mt, constant = 1.4826, na.rm = TRUE)
  )
}



Cell level (post removal)

Here we will remove dead cells, doublets, and cells with low molecule counts. We do so by setting cut-offs for the following 3 parameters:

  1. nFeature_RNA is the number of genes detected in each cell.
  2. nCount_RNA is the total number of molecules detected within a cell
  3. percent.mt is the percent of features that are mitochondrial genes

NOTE

There are several things to consider. For one, a much higher number of gene counts of molecule counts may indicate the presence of doublets, especially for droplet-based scRNA-seq like 10X. In addition, high mitochondrial gene percentages may indicate dead cells, which is common in plate-based scRNA-seq like smart-seq.

post_plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    plot_grid(

      violin_me(seurats[[i]], "nFeature_RNA", colorz = colz[[i]], themes$QB_theme_90_small(), legend = "none") + ggtitle("Gene count", subtitle = paste0(names(seurats[i]))),
      violin_me(seurats[[i]], "nCount_RNA", colorz = colz[[i]], themes$QB_theme_90_small(), legend = "none") + ggtitle("UMI count", subtitle = paste0(names(seurats[i]))),
      violin_me(seurats[[i]], "percent.mt", colorz = colz[[i]], themes$QB_theme_90_small(), legend = "none") + ggtitle("% MT genes", subtitle = paste0(names(seurats[i]))),

      ncol = 1  ### For column count

    ),
    list(i = i)))
  post_plots[[i]] <- p1  # add each plot into plot list
}


#### Plot it
for (i in 1:length(post_plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- post_plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none



Gene level

Here we remove genes that only show up in a certain number of cells. You can hit the “code” button to reveal the underlying code to remove low count genes. There may be genes with zero counts across all cells too. These reduce the average expression for a cell and should be removed, as well. For our data we choose to keep only genes which are expressed in 10 or more cells.


This subsection is code only. Hit the “code” button to reveal the code.

### Create the new seurat object
seurats <- lapply(X = seurats, FUN = function(x) {

  CreateSeuratObject(LayerData(x, assay = "RNA", layer = "counts"),
    meta.data = x@meta.data,
    min.cells = 10,
    min.features = 10

  )
})



Sample level

Here we remove samples that are flagged as outliers. This will remove whole samples if the sample meets the following criteria:

  • Is sample UMI median below 25% quantile of the UMI count of the entire dataset?
  • Is sample gene count below 25% quantile of the gene count of the entire dataset?
  • Is sample mt.percent above 75% quantile of the mt.percent of whole dataset?

NOTE

We use quantiles as opposed to 3 x MAD because the bottom 3 x MAD will go into a negative value, which does nothing.

### Plot the density of gene expression values by sample
m.melts <- list()
for (i in 1:length(seurats)) {
  m.melts[[i]] <- seurats[[i]]@meta.data |>
    select(params$qcs_patient_id, nCount_RNA, nFeature_RNA, percent.mt) |>
    gather(key = "variable", value = "value", -params$qcs_patient_id)
}

### ---------------------STATISTIC CALCULATIONS----------------------------------
### Calculate some statistics to be used for plotting
count.meds <- lapply(X = m.melts, FUN = function(x) {

  ddply(x |> filter(variable == "nCount_RNA"),
    params$qcs_patient_id, summarise, med = median(value))

})

feature.meds <- lapply(X = m.melts, FUN = function(x) {

  ddply(x |> filter(variable == "nFeature_RNA"),
    params$qcs_patient_id, summarise, med = median(value))

})

mt.meds <- lapply(X = m.melts, FUN = function(x) {

  ddply(x |> filter(variable == "percent.mt"),
    params$qcs_patient_id, summarise, med = median(value))

})

### Calculate overall medians
overall.stats <- list()
for (i in 1:length(m.melts)) {
  overall.stats[[i]] <- m.melts[[i]] |> group_by(variable) |>
    summarize(median = median(value, na.rm = TRUE),
      quantile = quantile(value, na.rm = TRUE))
  overall.stats[[i]]$quantile_name <- c(rep(c("0%", "25%", "50%", "75%", "100%"), 3))
}

### Define variable for plotting, and the cutoffs to use
variables <- c("nCount_RNA", "nFeature_RNA", "percent.mt")
calculations <- c("count_quants", "feature_quants", "mt_quants")
cutoffs <- list()
for (i in 1:length(seurats)) {
  cutoffs[[i]] <- c(subset(overall.stats[[i]], variable == "nCount_RNA" & quantile_name == "25%")$quantile,
    subset(overall.stats[[i]], variable == "nFeature_RNA" & quantile_name == "25%")$quantile,
    subset(overall.stats[[i]], variable == "percent.mt" & quantile_name == "75%")$quantile)
  names(cutoffs[[i]]) <- c("nCount_RNA", "nFeature_RNA", "percent.mt")
}
### -------------------------END STATISTICS CALC--------------------------------
### Plot the stuff
ggplotz <- list()
for (i in 1:length(m.melts)) {

  p1 <- eval(substitute(
    plot_grid(

      sample_boxplot(dat = m.melts[[i]] |> filter(variable == variables[1]), varz = variables[1],  theme = themes$QB_theme_90_small(), colorz = colz[[i]],
        line = cutoffs[[i]]["nCount_RNA"]),
      sample_boxplot(dat = m.melts[[i]] |> filter(variable == variables[2]), varz = variables[2],  theme = themes$QB_theme_90_small(), colorz = colz[[i]],
        line = cutoffs[[i]]["nFeature_RNA"]),
      sample_boxplot(dat = m.melts[[i]] |> filter(variable == variables[3]), varz = variables[3],  theme = themes$QB_theme_90_small(), colorz = colz[[i]],
        line = cutoffs[[i]]["percent.mt"]),
      ncol = 1

    ),
    list(i = i)))
  ggplotz[[i]] <- p1  # add each plot into plot list
}

#### Plot it
for (i in 1:length(ggplotz)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- ggplotz[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none



Removed samples (if any)

### Remove samples that meet all 3 of the following criteria:
d <- list()
keepme <- list()
for (i in 1:length(count.meds)) {
  d[[i]] <- cbind(count.meds[[i]], feature.meds[[i]]$med, mt.meds[[i]]$med)
  colnames(d[[i]]) <- c("Samples", "Count_medians", "Feature_medians", "Percent_mito")

  #### Check if any of the 4 variables are outliers
  d[[i]]$`Less than UMI count Q25` <- d[[i]]$Count_medians < ifelse(params$qcs_count_cutoff == "NA",
    cutoffs[[i]]["nCount_RNA"],
    params$qcs_count_cutoff)
  d[[i]]$`Less than gene count Q25` <- d[[i]]$Feature_medians < ifelse(params$qcs_count_cutoff == "NA",
    cutoffs[[i]]["nFeature_RNA"],
    params$qcs_feature_cutoff)
  d[[i]]$`Greater than %mt Q75` <- d[[i]]$Percent_mito > ifelse(params$qcs_count_cutoff == "NA",
    cutoffs[[i]]["percent.mt"],
    percent.mt_cutoff)

  ### Annotate the samples in which all 3 variables are outliers
  d[[i]]$remove_me <- d[[i]]$`Less than UMI count Q25` == "TRUE" &
    d[[i]]$`Less than gene count Q25` == "TRUE" &
    d[[i]]$`Greater than %mt Q75` == "TRUE"

  ### For annotation purposes
  d[[i]]$Group <- names(seurats[i])

  keepme[[i]] <- d[[i]]$Samples[which(d[[i]]$remove_me == "FALSE")]

}

### Unlist it for printing
stacked_dataframe <- do.call(rbind, d) |>
  select(Group,
    Samples,
    Count_medians,
    Feature_medians,
    Percent_mito,
    `Less than UMI count Q25`,
    `Less than gene count Q25`,
    `Greater than %mt Q75`,
    remove_me) |>
  rename(`Median UMI count` = Count_medians,
    `Median gene count` = Feature_medians,
    `Median %mt genes` = Percent_mito,
    `Selected for removal` = remove_me)

### Table it
stacked_dataframe |> render_dt()



### Remove from seurat
for (i in 1:length(seurats)) {
  Idents(seurats[[i]]) <- seurats[[i]]@meta.data[, params$qcs_patient_id]
  seurats[[i]] <- subset(seurats[[i]], idents = keepme[[i]])
}



1.2 - Standardize

In this section, we normalize the data, regress out confounding variables, and scale the data. Each tab provides more detail.

Normalize the data

This subsection is code only. Hit the “code” button to reveal the code. For data normalization, we use NormalizeDatas() that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in [[“RNA”]]@data.

seurats <- lapply(X = seurats, FUN = function(x) {
  NormalizeData(x,
    normalization.method = "LogNormalize",
    scale.factor = 10000)
})



Regress out confounding variables

Calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). These genes help to highlight biological signal in single-cell datasets.

By default, FindVariableFeatures returns 2,000 features per dataset, but can be changed to any number. These may be used in downstream analysis, like PCA.

NOTE

This will be done for each tissue type individually, so make sure to run through them one at a time.

### Find the top 2000 most variable genes for PCA analysis
seurats <- lapply(X = seurats, FUN = function(x) {
  FindVariableFeatures(object = x, selection.method = "vst",
    nfeatures = 2000)
})

### Plot variable features with labels
### Identify the 10 most highly variable genes
top10 <- list()
labels <- list()
LabelPoints <- list()
for (i in 1:length(seurats)) {
  top10[[i]] <- head(VariableFeatures(seurats[[i]]), 10)
  labels[[i]] <- VariableFeaturePlot(seurats[[i]])
  LabelPoints(plot = labels[[i]], points = top10[[i]], repel = TRUE)
}

### Generate a list of plots based on the above information
var_plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    LabelPoints(plot = labels[[i]], points = top10[[i]], repel = TRUE),
    list(i = i)))
  var_plots[[i]] <- p1  # add each plot into plot list
}


#### Plot it out
for (i in 1:length(var_plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- var_plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none

Variable features with labels. Identiy the 10 most highly variable genes.

Variable features with labels. Identiy the 10 most highly variable genes.

Scale the data

This subsection is code only. Hit the “code” button to reveal the code. Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function:

  1. Shifts the expression of each gene, so that the mean expression across cells is 0
  2. Scales the expression of each gene, so that the variance across cells is 1. This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
  3. The results of this are stored in [[“RNA”]]@scale.data

NOTE

By default ScaleData() will only scale the identified 2,000 genes unless specified using the "features = all.genes" argument.

if (params$qcs_scale_genes == "all") {
  seurats <- lapply(X = seurats, FUN = function(x) {
    ScaleData(object = x,
      features = rownames(seurat_analysed),
      verbose = FALSE,
      vars.to.regress = c(unlist(strsplit(params$qcs_vars_regress,
        split = ",", fixed = T))))
    print("Scaled using all genes")
  })
} else {
  seurats <- lapply(X = seurats, FUN = function(x) {
    ScaleData(object = x,
      verbose = FALSE,
      vars.to.regress = c(unlist(strsplit(params$qcs_vars_regress,
        split = ",", fixed = T))))
  })
  print("Scaled using top variable genes")

}
## [1] "Scaled using top variable genes"



1.3 - Identify the cell cycle phase for each cell

Here we infer the cell cycle phase in which the each cell is suspected to be in based on gene expression levels of cell cycle markers.

Score the cell cycle

This subsection is code only. Hit the “code” button to reveal the code.

if (params$organism == "human") {
  ### [1] Create G2M marker gene list (i.e. genes associated with the G2M phase of the
  ### cell cycle using Seurat's built-in cc.genes (cell cycle) genes list
  g2m.genes <- cc.genes$g2m.genes
  s.genes <- cc.genes$s.genes
  g2m <- rownames(seurats[[1]])[rownames(seurats[[1]]) %in% g2m.genes]

} else if (params$organism == "mouse") {
  cc_file <- RCurl::getURL("https://raw.githubusercontent.com/hbc/tinyatlas/master/cell_cycle/Mus_musculus.csv")
  cell_cycle_genes <- read.csv(text = cc_file)

  ah <- AnnotationHub::AnnotationHub()

  # Access the Ensembl database for organism
  ahDb <- AnnotationHub::query(ah,
    pattern = c("Mus musculus", "EnsDb"),
    ignore.case = TRUE)

  # Acquire the latest annotation files
  id <- ahDb %>%
    GenomicRanges::mcols() %>%
    rownames() %>%
    tail(n = 1)

  # Download the appropriate Ensembldb database
  edb <- ah[[id]]

  # Extract gene-level information from database
  annotations <- genes(edb,
    return.type = "data.frame")

  # Select annotations of interest
  annotations <- annotations %>%
    dplyr::select(gene_id, gene_name, seq_name, gene_biotype, description)


  # Get gene names for Ensembl IDs for each gene
  cell_cycle_markers <- dplyr::left_join(cell_cycle_genes, annotations, by = c("geneID" = "gene_id"))

  # Acquire the S phase genes
  s.genes <- cell_cycle_markers %>%
    dplyr::filter(phase == "S") %>%
    dplyr::pull("gene_name")

  # Acquire the G2M phase genes
  g2m.genes <- cell_cycle_markers %>%
    dplyr::filter(phase == "G2/M") %>%
    dplyr::pull("gene_name")
}


### [2] Calculate G2M marker module score.
seurats <- lapply(X = seurats, FUN = function(x) {
  CellCycleScoring(x,
    s.features = s.genes, ### Genes associated with s-phase
    g2m.features = g2m.genes, ### Genes associated with G2M phase
    set.ident = TRUE)
})

1.4 - Write to file

The QC is done, the file is save on ~/public-data/quantbio/scRNA-seq/IPF_adams_2020/Seurat_IPF_adams_QCS.rds.



2 - Annotation

In this section, we perform cell type annotation of single cell RNA-seq datasets. This chunk supports cell type annotation for human or moiuse cells only, and uses SingleR to annotate cells. This identifies marker genes from the reference and uses them to compute assignment scores (based on the Spearman correlation across markers) for each cell in the test dataset against each label in the reference. The label with the highest score is the assigned to the test cell, possibly with further fine-tuning to resolve closely related labels. We prune low-soring cells in this section, as well.

2.1 - Annotate with singleR

This subsection is code only. Hit the “code” button to reveal the code. Reference data are load with Ensembl annotations. ensembl = F indicates to not convert row names to ensembl IDs. The selected annotation package is HumanAtlas

if (params$anno_annotation == "HumanAtlas") {
  ref.data <- HumanPrimaryCellAtlasData(ensembl = FALSE)
} else if (params$anno_annotation == "ImmGenData") {
  ref.data <- ImmGenData(ensembl = FALSE)
}


### Assign the cell types. For this step, I'm parellelizing it with
### the BiocParallel library. Parallelizing may not be necessary or helpful,
### but its here.
cells <- lapply(X = seurats, FUN = function(x) {
  SingleR(test = LayerData(x, assay = "RNA", layer = "counts"),
    ref = ref.data, labels = ref.data$label.main,
    BPPARAM = MulticoreParam(detectCores()))
})

### Add to seurat
for (i in 1:length(seurats)) {
  seurats[[i]]$SingleR <- cells[[i]]$labels
}

2.2 - Perform QC checks on new cell labels

1 - Heatmap before pruning

Here we prune out cell labels with low quality cell-type-assignment. Low delta values are caused by:

  1. ambiguous assignments with closely related reference labels
  2. incorrect assignments that match poorly to all reference labels.

We first check the scores and the number of low-quality labels

### Generate a list of plots based on the above information
plots <- list()
for (i in 1:length(seurats)) {
  plotScoreHeatmap(cells[[i]], main = names(seurats[i]))
}
All cell types identified and their scores (pre pruned)

All cell types identified and their scores (pre pruned)



2 - Delta score before pruning

### Generate a list of plots based on the above information
plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    plotDeltaDistribution(cells[[i]],
      ncol = 5,
      size = 0.5) +
      themes$QB_theme() +
      ylab("Delta score") +
      xlab("Cell labels") +
      ggtitle(names(seurats[i])),
    list(i = i)))
  plots[[i]] <- p1  # add each plot into plot list
}

#### Plot it out
for (i in 1:length(plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none

Number of low-quality labels per cell assignment (pre pruned)

Number of low-quality labels per cell assignment (pre pruned)



3 - Count of cells to be pruned

Table of pruned cell counts:

tabs <- list()
for (i in 1:length(cells)) {
  tabs[[i]] <- table(table(is.na(cells[[i]]$pruned.labels)) |>
    as.data.frame() |>
    dplyr::mutate(Group = names(seurats[i])) |>
    dplyr::rename(Pruned = Var1, Count = Freq))

  tabs[[i]] <- as.data.frame(tabs[[i]]) |>
    dplyr::filter(Freq > 0) |>
    dplyr::select(Group, Pruned, Count)
  tabs[[i]] <- tabs[[i]] |> render_dt()

}


#### Plot it
for (i in 1:length(tabs)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- tabs[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none



## Prune out the low-quality reads
prune <- list()
keep <- list()
for (i in 1:length(cells)) {
  prune[[i]] <- pruneScores(cells[[i]],
    nmads = 3,
    min.diff.med = -Inf,
    min.diff.next = 0,
    get.thresholds = FALSE
  )
  prune[[i]] <- which(prune[[i]] == "TRUE")
  keep[[i]] <- cells[[i]][-prune[[i]], ]
}

4 - Heatmap after pruning

### Re-plot to ensure prune worked
plots <- list()
for (i in 1:length(seurats)) {
  plotScoreHeatmap(keep[[i]], main = names(seurats[i]))
}
All cell types identified and their scores (after prune)

All cell types identified and their scores (after prune)



5 - Delta score after pruning

plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    plotDeltaDistribution(keep[[i]],
      ncol = 5,
      size = 0.5) +
      themes$QB_theme() +
      ylab("Delta score") +
      xlab("Cell labels") +
      ggtitle(names(seurats[i])),
    list(i = i)))
  plots[[i]] <- p1  # add each plot into plot list
}

#### Plot it out
for (i in 1:length(plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none

Number of low-quality labels per cell assignment (after prune)

Number of low-quality labels per cell assignment (after prune)



6 - Prune cells from seurat

This subsection is code only. Hit the “code” button to reveal the code.

### Prune the cells in the subsetted seurat objects
for (i in 1:length(seurats)) {
  seurats[[i]] <- seurats[[i]][, -prune[[i]]]
  seurats[[i]]$SingleR <- factor(seurats[[i]]$SingleR)
}



2.4 - Merge similar cell types


Merge ambiguous cell types into:

  1. Fibroblasts
  2. Macrophages
  3. B_cells
  4. Others

If the cells are below 10, they are merge to “other”.

NOTE:

Neurons are also added to this group because neurons aren’t an anticipated cell type. Also, if the cells are B or T, cell subtypes merge with B and T cells, respectively.

### So not to lose the low count b cell subtypes, collapse those first
freqs <- list()
others <- list()
metas <- list()
for (i in 1:length(seurats)) {
  seurats[[i]]$SingleR2 <- fct_collapse(seurats[[i]]$SingleR,
    B_cell = params$anno_Bcells,
    HSC = params$anno_HCS,
    Fibroblasts = params$anno_fibroblast)

  ### Generate frequency tables
  metas[[i]] <- seurats[[i]]@meta.data
  freqs[[i]] <- data.frame(tabyl(metas[[i]]$SingleR2, sort = TRUE))
  colnames(freqs[[i]]) <- c("Cells", "n", "Proportion")
  freqs[[i]]$Percent <- freqs[[i]]$Proportion * 100
  freqs[[i]] <- freqs[[i]][order(freqs[[i]]$n, decreasing = TRUE), ]

  ### I used the above "tabyl" function to get the proportion of cells out of
  ### all total cells in case we want to set a cut-off using a proportion rather
  ### than total count.
  others[[i]] <- as.vector(freqs[[i]]$cells[which(freqs[[i]]$n < params$anno_ctoff)])
  seurats[[i]]$SingleR2 <- fct_collapse(seurats[[i]]$SingleR2,
    Others = c(others,
      params$anno_Others))
  Idents(seurats[[i]]) <- seurats[[i]]$SingleR2
}

Count cells types



2.5 - Identify tumor cells

This subsection is code only. Hit the “code” button to reveal the code.This step utilizes inferCNV to infer copy number variations that will allow us to identify malignant cells.

NOTE: For Now Skip



Replace the identified malignant cells types in the original seurat

2.4 - Write to file

The annotation is done, the file is save on ~/public-data/quantbio/scRNA-seq/IPF_adams_2020/Seurat_IPF_adams_ANNO.rds.



3 - Differential Expression analysis

In this section, we run dimension reduction (PCA, tSNE, and UMAP), clustering, and identify differentially expressed genes in the scRNA-seq data. This section is important if you want a final seurat object that includes dimension reduction in the object, we well as identifying markers for each cell type in the dataset.

3.1 - PCA analyses

Perform linear dimensional reduction via PCA. We use different methods to run diagnostics on the Principal Components (PCs).

PCA loadings

This plot is to identify genes contributing to desired PCAs

### Generate a list of plots based on the above information
plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    VizDimLoadings(seurats[[i]],
      dims = 1:4,
      reduction = "pca"),
    list(i = i)))
  plots[[i]] <- p1  # add each plot into plot list
}


#### Plot it out
for (i in 1:length(plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none

Identify genes contributing to desired PCAs

Identify genes contributing to desired PCAs



PCA by cell types

plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    DimPlot(seurats[[i]], reduction = "pca",
      cols = cell.cols,
      pt.size = 0.5) +
      ggtitle("PCA of cell types defined by SingleR") +
      themes$QB_theme() +
      ylab("PC 2") +
      xlab("PC 1") +
      theme(axis.title.x = element_text()),
    list(i = i)))
  plots[[i]] <- p1  # add each plot into plot list
}

#### Plot it out
for (i in 1:length(plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none



PCA by features

plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    FeaturePlot(seurats[[i]],
      reduction = "pca",
      features = c("nFeature_RNA",  "percent.mt"),
      raster = FALSE),
    list(i = i)))
  plots[[i]] <- p1  # add each plot into plot list
}

#### Plot it out
for (i in 1:length(plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none



PCA by phase

for (i in 1:length(seurats)) {
  Idents(seurats[[i]]) <- seurats[[i]]@meta.data$Phase
}

plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    DimPlot(seurats[[i]], reduction = "pca",
      cols = phase.colors,
      pt.size = 0.5) +
      ggtitle("PCA of cells colored by cell cycle phase") +
      themes$QB_theme() +
      ylab("PC 2") +
      xlab("PC 1") +
      theme(axis.title.y = element_text(),
        axis.title.x = element_text()),
    list(i = i)))
  plots[[i]] <- p1  # add each plot into plot list
}

#### Plot it out
for (i in 1:length(plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none

### Return identities
for (i in 1:length(seurats)) {
  Idents(seurats[[i]]) <- seurats[[i]]@meta.data[, params$dex_ident]
}

Dimension heatmap

This method allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses.

#### Plot it out
for (i in 1:length(seurats)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))
  title_plots <- print(DimHeatmap(seurats[[i]], dims = 1:15, cells = 500, balanced = TRUE))
  cat("\n \n \n")
}

none

NULL



Elbow plot

We look for an ‘elbow’ in the plot (usually around PC6-7). However, even PCs up to usually 12 still maintain a high standard deviation.

This is supported by the heatmaps, suggesting that the majority of true signal is captured in the first 12 PCs.

### Make some more lists
pcas <- list()
eigValues <- list()
varExplained <- list()
PCs <- list()
ndims <- list()

### Determine which PCs explain 80% of the variance
for (i in 1:length(seurats)) {

  pcas[[i]] <- seurats[[i]]@reductions$pca
  eigValues[[i]] <- (pcas[[i]]@stdev)^2 ## EigenValues
  varExplained[[i]] <- eigValues[[i]] / sum(eigValues[[i]])
  PCs[[i]] <- sum(ifelse(cumsum(varExplained[[i]]) <= params$dex_variance_explained, 1, 0))
  if (PCs[[i]] < 5) {
    PCs[[i]] <- 5
  }
  ndims[[i]] <- ifelse(PCs[[i]] >= 20, PCs[[i]] + 1, 30)

}

### Make some more plots
plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    ElbowPlot(seurats[[i]], ndims = ndims[[i]]) +
      ggplot2::geom_vline(xintercept = PCs[[i]], linetype = "dashed", color = "blue"),
    list(i = i)))
  plots[[i]] <- p1  # add each plot into plot list
}

#### Plot it out
for (i in 1:length(plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none

Identify most influential PCs

Identify most influential PCs



We use PCs that capture the 80% of the variance. This is a good rule of thumb for most datasets, and is a good starting point for most analyses.

3.2 - Confounding variables

Categorical and numerical vars vs. PCs

Variables that are associated with the top 10 PCs. These variables may be confounding variables and worth considering in downstream analyses.

### Calculate PCs
mats <- list()
pca.list <- list()

### Calculate PC statistics (again)
for (i in 1:length(seurats)) {
  mats[[i]] <- as.data.frame(LayerData(seurats[[i]], assay = "RNA", layer = "counts"))
  pca.list[[i]] <- list(All = FactoMineR::PCA(t(mats[[i]]), ncp = 10, graph = FALSE, scale = FALSE))
}

### Calculate covariates
anova.results <- list()
glm.results <- list()
results <- list()
df.melt <- list()
for (i in 1:length(pca.list)) {
  ### [1] Run ANOVA of categorical variables
  anova.results[[i]] <- anova_PCA(obj_name = "All",
    pca.df = pca.list[[i]]$All,
    metadata = seurats[[i]]@meta.data,
    merge_ID = "Cell",
    cat_var = params$dex_categorical_vars,
    numPCs = 10)
  rownames(anova.results[[i]]) <- anova.results[[i]]$Factor

  ### [2] Run GLM for continuous variables
  glm.results[[i]] <- glm_PCA(obj_name = "All",
    pca.list = pca.list[[i]]$All,
    metadata = seurats[[i]]@meta.data,
    merge_ID = "Cell",
    num_var = params$dex_numerical_vars,
    numPCs = 10)
  rownames(glm.results[[i]]) <- glm.results[[i]]$Factor

  ### [3] Melt results for plotting
  results[[i]] <- rbind(anova.results[[i]], glm.results[[i]])
  df.melt[[i]] <- melt(results[[i]])
  df.melt[[i]] <- cbind(df.melt[[i]], PC = as.numeric(gsub("Dim|_adj_pvalue", "", df.melt[[i]]$variable)))
  df.melt[[i]]$ajp <- ifelse(df.melt[[i]]$value <= 0.001, "<=0.001", df.melt[[i]]$value)
  df.melt[[i]]$ajp <- ifelse(df.melt[[i]]$ajp <= 0.01 &  df.melt[[i]]$ajp > 0.001, "<=0.01",  df.melt[[i]]$ajp)
  df.melt[[i]]$ajp <- ifelse(df.melt[[i]]$ajp <= 0.05 &  df.melt[[i]]$ajp > 0.01, "<=0.05",  df.melt[[i]]$ajp)
  df.melt[[i]]$ajp <- ifelse(df.melt[[i]]$ajp > 0.05, "NS",  df.melt[[i]]$ajp)
}

### Plot
plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    ggplot(df.melt[[i]], aes(x = PC, y = Factor)) +
      geom_tile(aes(fill = ajp), color = "white") +
      scale_fill_manual(values = c("<=0.001" = "#4c1d4bff",
        "<=0.01" = "#bd1655ff",
        "<=0.05" = "#f47f58ff",
        "NS" = "#FAEBddff"),
      name = "FDR adjusted p-value") +
      labs(title = "Covariates vs. PCs",
        subtitle = paste0("Data is from: ", names(seurats[i]))) +
      themes$QB_theme() +
      ylab(""),
    list(i = i)))
  plots[[i]] <- p1  # add each plot into plot list
}

#### Plot it out
for (i in 1:length(plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}



3.3 - uMAP and tSNE visualisations


NOTE

It is encouraged users to repeat downstream analyses with a different number of PCs (10, 15, or even 50!). As the results often do not differ dramatically. It is also advised users to err on the higher side when choosing this parameter. Because downstream analyses with only 5 PCs do significantly and adversely affect results.


Regarding clustering methods, SeuratV3 embeds cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempts to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.

We first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors()

To cluster the cells, we next apply modularity optimization techniques implemented in the function FindClusters() from the Seurat package.

Available options are the Louvain algorithm (default) or SLM Blondel et al. 2008, that iteratively group cells together, with the goal of optimizing the standard modularity function. Optimal resolution for 3K cells often ranges from 0.4 to 1.2 and increases the larger the dataset.

The option selected for this analysis is Original Louvain algorithm.

UMAP colored by cell type

### Calculate UMAP, tSCNE, and identify clusters
for (i in 1:length(seurats)) {
  seurats[[i]] <- FindNeighbors(seurats[[i]], dims = 1:PCs[[i]])
  seurats[[i]] <- FindClusters(seurats[[i]],
    algorithm = params$dex_cluster_algorithm)
  seurats[[i]] <- RunUMAP(seurats[[i]], dims = 1:PCs[[i]])
  seurats[[i]] <- RunTSNE(seurats[[i]], dims = 1:PCs[[i]], check_duplicates = FALSE)
}

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 274515 Number of edges: 7588427

Running Louvain algorithm… Maximum modularity in 10 random starts: 0.8894 Number of communities: 26 Elapsed time: 254 seconds

### Make some more plots
plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    DimPlot(seurats[[i]], reduction = "umap",
      group.by = "SingleR2",
      cols = cell.cols,
      #        shuffle = TRUE,
      #        repel=TRUE,
      #        label=TRUE, label.size = 2.5,label.box = TRUE,
      pt.size = 0.5) +
      ggtitle("UMAP of cell types defined by SingleR") +
      themes$QB_theme() +
      ylab("UMAP 2") +
      xlab("UMAP 1") +
      theme(axis.title.y = element_text(),
        axis.title.x = element_text()),
    list(i = i)))
  plots[[i]] <- p1  # add each plot into plot list
}

#### Plot it out
for (i in 1:length(plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none



UMAP colored by cell cycle phase

### Color by cell cycle phase
### Make some more plots
plots <- list()
for (i in 1:length(seurats)) {
  p1 <- eval(substitute(

    DimPlot(seurats[[i]], reduction = "umap",
      group.by = "Phase",
      cols = phase.colors,
      #        shuffle = TRUE,
      #        repel=TRUE,
      #        label=TRUE, label.size = 2.5,label.box = TRUE,
      pt.size = 0.5) +
      ggtitle("UMAP of cell types colored by cell cycle phase") +
      themes$QB_theme() +
      ylab("UMAP 2") +
      xlab("UMAP 1") +
      theme(axis.title.y = element_text(),
        axis.title.x = element_text()),
    list(i = i)))
  plots[[i]] <- p1  # add each plot into plot list
}

#### Plot it out
for (i in 1:length(plots)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))

  title_plots <- plots[[i]]
  print(title_plots)
  cat("\n \n \n")
}

none

3.4 Differential expression per cluster

Seurat can help you find markers that define clusters via differential expression. By default, it identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

The min.pct argument requires a feature to be detected at a minimum percentage in either of the two groups of cells, and the thresh.test argument requires a feature to be differentially expressed (on average) by some amount between the two groups. You can set both of these to 0, but with a dramatic increase in time - since this will test a large number of features that are unlikely to be highly discriminatory. Currently, the values for min.pct and thresh.test are 0.25 and 0, respectivelly.

As another option to speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed increases can be significant and the most highly differentially expressed features will likely still rise to the top.

if (file.exists(params$dex_mat_save_name)) {
  DE_seurats <- readRDS(params$dex_mat_save_name)
  text <- paste0("DEX data source: ", "DE file read successfully.")
} else {
  ### Find markers for every cluster compared to all remaining cells
  DE_seurats <- list()
  for (i in 1:length(seurats)) {
    Idents(seurats[[i]]) <- seurats[[i]]@meta.data[, params$dex_ident]
    DE_seurats[[i]] <- FindAllMarkers(seurats[[i]],
      only.pos = params$dex_only_positive,
      min.pct = params$dex_min_pct,
      logfc.threshold = params$dex_logfc_threshold)
    DE_seurats[[i]]$params_group <- names(seurats[i])
  }

  ### Write to file
  write_rds(DE_seurats, file = params$dex_mat_save_name)
  text <- paste0("DEX data source: ", "DE file created and data written successfully.")
}

text



DEGs table

### Print the heatmaps
for (i in 1:length(seurats)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))
  DE_seurats[[i]] |> render_dt()
  cat("\n \n \n")
}



DEGs heatmap

### Make a list of a list. These are the items necessary to make a heatmap
heatmap_stuff <- list()
for (i in 1:length(DE_seurats)) {
  heatmap_stuff[[i]] <- LetsMakeAHeatmap(dex_data = DE_seurats[[i]],
    pval = 0.05,
    top_n = 5,
    log2FC = 0.0,
    sample_size = 250,
    seurat = seurats[[i]],
    variable = "SingleR2")
}

### Make the annotation objects
has <- list()
ras <- list()
morecols <- list()
has <- for (i in 1:length(heatmap_stuff)) {
  ### 1) Column annotations
  has[[i]] <- HeatmapAnnotation(`TEMPUS cell type` = heatmap_stuff[[i]]$seurat_sample$SingleR2,
    col = list(`Group` = cell.cols
    ),
    annotation_name_side = "left",
    border = TRUE,
    simple_anno_size = unit(0.35, "cm"))

  ### 2) Row annotations
  ras[[i]] <- rowAnnotation(`Fold-change` = anno_barplot(heatmap_stuff[[i]]$top$avg_log2FC,
    add_numbers = F),
  `Group source` = heatmap_stuff[[i]]$top$cluster,
  `Dup.` = heatmap_stuff[[i]]$top$Duplicated,
  col = list(`Group source` = cell.cols,
    `Dup.` = c("Yes" = "yellow",
      "No" = "purple")
  ),
  gp = gpar(col = "black"),
  annotation_name_rot = 90,
  show_legend = c(`Sample source` = FALSE)
  )

  ### 3) Color scheme 3 for expression levels
  morecols[[i]] <- colorRamp2(c(0, max(as.matrix(heatmap_stuff[[i]]$mat)),
    max(heatmap_stuff[[1]]$mat)), c("white", "orangered2", "black"))
}


### Plot the heatmap for tumor tissue
maps <- list()
for (i in 1:length(heatmap_stuff)) {

  maps[[i]] <- Heatmap(heatmap_stuff[[i]]$mat,
    name = "Expression",
    col = morecols[[i]],
    border = T,
    border_gp = gpar(col = "darkgrey", lty = 1),

    ### Column stuff
    top_annotation = has,
    show_column_names = F,
    column_split = factor(heatmap_stuff[[i]]$seurat_sample$SingleR2,
      levels = heatmap_stuff[[i]]$lvls$Var1),

    cluster_column_slices = F,
    column_gap = unit(1.0, "mm"),
    column_title_side = "top",
    column_title = paste0(names(seurats[i]), " cells only"),

    ### Row stuff
    left_annotation = ras[[i]],
    show_row_names = F,
    row_split = factor(heatmap_stuff[[i]]$top$cluster,
      levels = heatmap_stuff[[i]]$lvls$Var1),
    cluster_row_slices =  F,
    row_gap = unit(1.0, "mm"),
    row_title = "Top DEGs per celltype",
    row_names_gp = gpar(fontsize = 7)
  )

}

### Print the heatmaps
for (i in 1:length(seurats)) {
  print(glue::glue("#### {as.vector(names(seurats[i]))}\n"))
  title_plots <- print(maps[[i]])
  cat("\n \n \n")
}

3.5 - Write seurat to file

The annotation is done, the file is save on ~/public-data/quantbio/scRNA-seq/IPF_adams_2020/Seurat_IPF_adams_FULLYPROCESSED.rds.



Information about the Session

Input parameters

for (param in names(params)) {
  cat(param, ": ", params[[param]], "\n")
}
## analysed_dataset :  Adams et al idiopathic pulmonary fibrosis (IPF) 
## more_authors :  Rosa Hernansaiz, Vincent Perez, and Jen Modliszewski 
## dir_seurat :  ~/public-data/quantbio/scRNA-seq/IPF_adams_2020/Seurat_IPF_adams.rds 
## is_list :  FALSE 
## split_by_group :  FALSE 
## group_name :  none 
## QCS :  TRUE 
## organism :  human 
## qcs_indent_var :  Subject_Identity 
## qcs_eval_cell :  TRUE 
## qcs_std_nFeature_RNA :  200 
## qcs_vars_regress :  nCount_RNA percent.mt 
## qcs_ncells :  10 
## qcs_nfeatures :  10 
## qcs_patient_id :  Subject_Identity 
## qcs_count_cutoff :  NA 
## qcs_feature_cutoff :  NA 
## qcs_mt_cutoff :  NA 
## qcs_eval_sample :  TRUE 
## qcs_normalize :  TRUE 
## qcs_variation_eval :  TRUE 
## qcs_scale :  TRUE 
## qcs_scale_genes :  default 
## qcs_cc :  TRUE 
## qcs_save :  TRUE 
## qcs_save_name :  ~/public-data/quantbio/scRNA-seq/IPF_adams_2020/Seurat_IPF_adams_QCS.rds 
## ANNO :  TRUE 
## anno_annotation :  HumanAtlas 
## anno_singleR :  TRUE 
## anno_Others :  Embryonic_stem_cells GMP Hepatocytes iPS_cells Keratinocytes MSC Neuroepithelial_cell Neurons 
## anno_cutoff :  10 
## anno_Bcells :  Pro-B_cell_CD34+ Pre-B_cell_CD34- B_cell 
## anno_HCS :  HSC_-G-CSF HSC_CD34+ 
## anno_fibroblast :  Fibroblasts Chondrocytes Smooth_muscle_cells Tissue_stem_cells 
## anno_save :  TRUE 
## anno_save_name :  ~/public-data/quantbio/scRNA-seq/IPF_adams_2020/Seurat_IPF_adams_ANNO.rds 
## DEx :  TRUE 
## dex_confounding :  FALSE 
## dex_categorical_vars :  timepoint Phase 
## dex_numerical_vars :  nCount_RNA nFeature_RNA percent.mt 
## dex_ident :  SingleR2 
## dex_cluster_algorithm :  1 
## dex_variance_explained :  0.8 
## dex_run_dex :  FALSE 
## dex_write_dex_seurat :  TRUE 
## dex_min_pct :  0.25 
## dex_only_positive :  TRUE 
## dex_logfc_threshold :  0 
## dex_p_val_adj :  0.1 
## dex_seurat_save_name :  ~/public-data/quantbio/scRNA-seq/IPF_adams_2020/Seurat_IPF_adams_FULLYPROCESSED.rds 
## dex_mat_save_name :  ~/public-data/quantbio/scRNA-seq/IPF_adams_2020/Seurats_IPF_adams_DEx_output.csv

Session information

Information about R, the OS and attached or loaded packages
pander::pander(sessionInfo(), compact = FALSE)

R version 4.4.0 (2024-04-24)

Platform: x86_64-pc-linux-gnu

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages:

  • grid
  • parallel
  • stats4
  • stats
  • graphics
  • grDevices
  • utils
  • datasets
  • methods
  • base

other attached packages:

  • circlize(v.0.4.16)
  • ggpubr(v.0.6.0)
  • readxl(v.1.4.3)
  • RColorBrewer(v.1.1-3)
  • scales(v.1.3.0)
  • annotate(v.1.82.0)
  • XML(v.3.99-0.17)
  • AnnotationDbi(v.1.66.0)
  • ComplexHeatmap(v.2.20.0)
  • pheatmap(v.1.0.12)
  • viridis(v.0.6.5)
  • viridisLite(v.0.4.2)
  • celldex(v.1.14.0)
  • BiocParallel(v.1.38.0)
  • janitor(v.2.2.0)
  • biomaRt(v.2.60.1)
  • glmnet(v.4.1-8)
  • Matrix(v.1.7-1)
  • SingleR(v.2.6.0)
  • SummarizedExperiment(v.1.34.0)
  • Biobase(v.2.64.0)
  • GenomicRanges(v.1.56.2)
  • GenomeInfoDb(v.1.40.1)
  • IRanges(v.2.38.1)
  • S4Vectors(v.0.42.1)
  • BiocGenerics(v.0.50.0)
  • MatrixGenerics(v.1.16.0)
  • matrixStats(v.1.4.1)
  • lubridate(v.1.9.3)
  • forcats(v.1.0.0)
  • stringr(v.1.5.1)
  • dplyr(v.1.1.4)
  • purrr(v.1.0.2)
  • tidyr(v.1.3.1)
  • tibble(v.3.2.1)
  • tidyverse(v.2.0.0)
  • conflicted(v.1.2.0)
  • styler(v.1.10.3)
  • ggplot2(v.3.5.1)
  • plyr(v.1.8.9)
  • cowplot(v.1.1.3)
  • readr(v.2.1.5)
  • Seurat(v.5.1.0)
  • SeuratObject(v.5.0.2)
  • sp(v.2.1-4)
  • data.table(v.1.16.2)
  • here(v.1.0.1)

loaded via a namespace (and not attached):

  • R.methodsS3(v.1.8.2)
  • progress(v.1.2.3)
  • goftest(v.1.2-3)
  • DT(v.0.33)
  • Biostrings(v.2.72.1)
  • HDF5Array(v.1.32.1)
  • vctrs(v.0.6.5)
  • spatstat.random(v.3.3-2)
  • digest(v.0.6.37)
  • png(v.0.1-8)
  • shape(v.1.4.6.1)
  • gypsum(v.1.0.1)
  • ggrepel(v.0.9.6)
  • deldir(v.2.0-4)
  • parallelly(v.1.40.1)
  • MASS(v.7.3-60.2)
  • reshape2(v.1.4.4)
  • rmdformats(v.1.0.4)
  • httpuv(v.1.6.15)
  • foreach(v.1.5.2)
  • withr(v.3.0.2)
  • ggrastr(v.1.0.2)
  • xfun(v.0.49)
  • survival(v.3.6-4)
  • memoise(v.2.0.1)
  • ggbeeswarm(v.0.7.2)
  • zoo(v.1.8-12)
  • GlobalOptions(v.0.1.2)
  • pbapply(v.1.7-2)
  • R.oo(v.1.27.0)
  • Formula(v.1.2-5)
  • prettyunits(v.1.2.0)
  • KEGGREST(v.1.44.1)
  • promises(v.1.3.2)
  • httr(v.1.4.7)
  • rstatix(v.0.7.2)
  • globals(v.0.16.3)
  • fitdistrplus(v.1.2-1)
  • rhdf5filters(v.1.16.0)
  • rhdf5(v.2.48.0)
  • rstudioapi(v.0.17.1)
  • UCSC.utils(v.1.0.0)
  • miniUI(v.0.1.1.1)
  • generics(v.0.1.3)
  • curl(v.6.0.1)
  • zlibbioc(v.1.50.0)
  • ScaledMatrix(v.1.12.0)
  • polyclip(v.1.10-7)
  • GenomeInfoDbData(v.1.2.12)
  • ExperimentHub(v.2.12.0)
  • SparseArray(v.1.4.8)
  • xtable(v.1.8-4)
  • doParallel(v.1.0.17)
  • evaluate(v.1.0.1)
  • S4Arrays(v.1.4.1)
  • BiocFileCache(v.2.12.0)
  • hms(v.1.1.3)
  • bookdown(v.0.41)
  • irlba(v.2.3.5.1)
  • colorspace(v.2.1-1)
  • filelock(v.1.0.3)
  • rcartocolor(v.2.1.1)
  • ROCR(v.1.0-11)
  • reticulate(v.1.40.0)
  • spatstat.data(v.3.1-4)
  • magrittr(v.2.0.3)
  • lmtest(v.0.9-40)
  • snakecase(v.0.11.1)
  • later(v.1.4.1)
  • lattice(v.0.22-6)
  • spatstat.geom(v.3.3-4)
  • future.apply(v.1.11.3)
  • scattermore(v.1.2)
  • RcppAnnoy(v.0.0.22)
  • pillar(v.1.9.0)
  • nlme(v.3.1-164)
  • iterators(v.1.0.14)
  • compiler(v.4.4.0)
  • beachmat(v.2.20.0)
  • RSpectra(v.0.16-2)
  • stringi(v.1.8.4)
  • tensor(v.1.5)
  • crayon(v.1.5.3)
  • abind(v.1.4-8)
  • bit(v.4.5.0.1)
  • codetools(v.0.2-20)
  • BiocSingular(v.1.20.0)
  • crosstalk(v.1.2.1)
  • bslib(v.0.8.0)
  • alabaster.ranges(v.1.4.2)
  • GetoptLong(v.1.0.5)
  • plotly(v.4.10.4)
  • mime(v.0.12)
  • splines(v.4.4.0)
  • Rcpp(v.1.0.13-1)
  • fastDummies(v.1.7.4)
  • dbplyr(v.2.5.0)
  • sparseMatrixStats(v.1.16.0)
  • cellranger(v.1.1.0)
  • knitr(v.1.49)
  • blob(v.1.2.4)
  • utf8(v.1.2.4)
  • clue(v.0.3-66)
  • BiocVersion(v.3.19.1)
  • listenv(v.0.9.1)
  • DelayedMatrixStats(v.1.26.0)
  • ggsignif(v.0.6.4)
  • tzdb(v.0.4.0)
  • pkgconfig(v.2.0.3)
  • tools(v.4.4.0)
  • cachem(v.1.1.0)
  • R.cache(v.0.16.0)
  • RSQLite(v.2.3.9)
  • DBI(v.1.2.3)
  • fastmap(v.1.2.0)
  • rmarkdown(v.2.29)
  • ica(v.1.0-3)
  • broom(v.1.0.7)
  • AnnotationHub(v.3.12.0)
  • sass(v.0.4.9)
  • patchwork(v.1.3.0)
  • BiocManager(v.1.30.25)
  • dotCall64(v.1.2)
  • carData(v.3.0-5)
  • RANN(v.2.6.2)
  • alabaster.schemas(v.1.4.0)
  • farver(v.2.1.2)
  • yaml(v.2.3.10)
  • cli(v.3.6.3)
  • leiden(v.0.4.3.1)
  • lifecycle(v.1.0.4)
  • uwot(v.0.2.2)
  • backports(v.1.5.0)
  • timechange(v.0.3.0)
  • gtable(v.0.3.6)
  • rjson(v.0.2.23)
  • ggridges(v.0.5.6)
  • progressr(v.0.15.1)
  • jsonlite(v.1.8.9)
  • RcppHNSW(v.0.6.0)
  • bit64(v.4.5.2)
  • Rtsne(v.0.17)
  • alabaster.matrix(v.1.4.2)
  • spatstat.utils(v.3.1-1)
  • jquerylib(v.0.1.4)
  • alabaster.se(v.1.4.1)
  • spatstat.univar(v.3.1-1)
  • R.utils(v.2.12.3)
  • lazyeval(v.0.2.2)
  • alabaster.base(v.1.4.2)
  • pander(v.0.6.5)
  • shiny(v.1.9.1)
  • htmltools(v.0.5.8.1)
  • sctransform(v.0.4.1)
  • rappdirs(v.0.3.3)
  • glue(v.1.8.0)
  • spam(v.2.11-0)
  • httr2(v.1.0.7)
  • XVector(v.0.44.0)
  • rprojroot(v.2.0.4)
  • gridExtra(v.2.3)
  • igraph(v.2.1.1)
  • R6(v.2.5.1)
  • labeling(v.0.4.3)
  • cluster(v.2.1.6)
  • Rhdf5lib(v.1.26.0)
  • DelayedArray(v.0.30.1)
  • tidyselect(v.1.2.1)
  • vipor(v.0.4.7)
  • xml2(v.1.3.6)
  • car(v.3.1-3)
  • future(v.1.34.0)
  • rsvd(v.1.0.5)
  • munsell(v.0.5.1)
  • KernSmooth(v.2.23-22)
  • htmlwidgets(v.1.6.4)
  • rlang(v.1.1.4)
  • spatstat.sparse(v.3.1-0)
  • spatstat.explore(v.3.3-3)
  • fansi(v.1.0.6)
  • beeswarm(v.0.4.0)